Compressed Perfect Embedded Skip Lists for Quick Inverted-Index Lookups

نویسندگان

  • Paolo Boldi
  • Sebastiano Vigna
چکیده

Large inverted indices are by now common in the construction of web-scale search engines. For faster access, inverted indices are indexed internally so that it is possible to skip quickly over unnecessary documents. The classical approach to skipping dictates that a skip should be positioned every √ f document pointers, where f is the overall number of documents where the term appears. We argue that due to the growing size of the web more refined techniques are necessary, and describe how to embed a compressed perfect skip list in an inverted list. We provide statistical models that explain the empirical distribution of the skip data we observe in our experiments, and use them to devise good compression techniques that allow us to limit the waste in space, so that the resulting data structure increases the overall index size by just a few percents, still making it possible to index pointers with a rather fine granularity.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Cache-Sensitive Skip List: Efficient Range Queries on Modern CPUs

Due to ever falling prices and advancements in chip technologies, many of today’s databases can be entirely kept in main memory. However, reusing existing disk-based index structures for managing data in memory leads to suboptimal performance due to inefficient cache usage and negligence of the capabilities of modern CPUs. Accordingly, a number of main-memory optimized index structures have bee...

متن کامل

Improved Skips for Faster Postings List Intersection

Information retrieval can be achieved through computerized processes by generating a list of relevant responses to a query. The document processor, matching function and query analyzer are the main components of an information retrieval system. Document retrieval system is fundamentally based on: Boolean, vector-space, probabilistic, and language models. In this paper, a new methodology for mat...

متن کامل

Improved Skips for Faster Postings List Intersection

Information retrieval can be achieved through computerized processes by generating a list of relevant responses to a query. The document processor, matching function and query analyzer are the main components of an information retrieval system. Document retrieval system is fundamentally based on: Boolean, vector-space, probabilistic, and language models. In this paper, a new methodology for mat...

متن کامل

A New Linear-Time Dynamic Dictionary Matching Algorithm

This research presents inverted lists as a new data structure for the dynamic dictionary matching algorithm. The inverted lists structure, which derives from the inverted index, is implemented by the perfect hashing table. The dictionary is constructed in optimal time and the individual patterns can be updated in minimal time. The searching phase scans the given text in a single pass, even in a...

متن کامل

Distributed search based on self-indexed compressed text

Query response times within a fraction of a second in Web search engines are feasible due to the use of indexing and caching techniques, which are devised for large text collections partitioned and replicated into a set of distributed memory processors. This paper proposes an alternative query processing method for this setting, which is based on a combination of self-indexed compressed text an...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005